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ABSTRACT 


Heart disease is the main reason for a huge number of deaths in the world 
over the last few decades and has evolved as the most life-threatening disease. 
The health care industry is found to be rich in information. So, there is aneed in 
to discover hidden patterns and trends in them. For this purpose, data mining 
techniques can be applied to extract the knowledge from the large sets of data. 
Many researchers, in recent times have been using several machine learning and 
techniques for predicting the heart related diseases as it can predict the 
disease effectively. Even though a machine learning technique proves to be 
effective in assisting the decision makers, still there is a scope for developing 
an accurate and efficient system to diagnose and predict the heart diseases 
thereby helping doctors with ease of work. This paper presents a survey of 
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various techniques used for predicting heart disease and reviews their 


performance. 
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1. INTRODUCTION 

Data mining is the process of examining large databases in 
order to discover the hidden patterns and correlations using 
statistics, machine learning, artificial intelligence and 
database technology. Tremendous amount of data is 
generated in medical field and it is important to mine those 
data for helping the practitioners in early diagnosis of 
disease. Heart disease causes immediate death and claims 
more lives each year than compared to all types of cancer or 
other major diseases. Heart disease prediction is very 
challenging area in the medical field because of several risk 
factors such as high blood pressure, high cholesterol, 
uncontrolled diabetes, abnormal pulse rate, obesity etc. 
Highly skilled and experienced physicians are required to 
diagnose the heart disease [1]. However, the death rate can 
be drastically reduced if the disease is detected at the early 
stages and also by adopting preventive measures. So, 
developing a heart disease prediction system is 
indispensable thereby preventing the patient’s death 
through early diagnosis. The main objective of this paper is 
to do a survey on the previous research work in heart 
disease prediction and analyzing the techniques used. 


The organization of the paper is as follows. Section 2 tells 
about the heart disease. Section 3 explains about the data 
mining algorithms for heart disease prediction. Section 4 
deals with the literature review. Section 5 shows the 
observation and section 6 concludes the paper. 


2. HEART DISEASE 
Heart disease occurs when plaque develops in the arteries 
and blood vessels that lead to the heart. This plaque blocks 
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oxygen and important nutrients from reaching your heart. 
The types of heart disease are listed below: 

Congenital heart disease 

Arrhythmia 

Coronary artery disease 

Dilated cardiomyopathy 

Myocardial infarction 

Mitral valve prolapse 

Pulmonary stenosis 

Hypertrophic cardiomyopathy 
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3. DATA MINING TECHNIQUES FOR PREDICTION 

3.1. DECISION TREE 

Decision Tree is a type of Supervised Machine Learning 
algorithm, where it identifies several ways to split the data 
continuously based on certain parameter. The tree consists 
of two entities, namely decision node and leaf node. The 
decision nodes are where the data is split and the leaf node 
are the decisions or the final outcomes. Decision tree 
performs both classification and regression tasks. The tree 
model where the target variable takes a set of discrete values 
are called classification trees and when the target variable 
takes continuous values are called as regression trees. 
Decision tree is widely used in data mining, statistics and 
machine learning. 


3.2. NAIVE BAYES 

A Naive Bayes classifier is a probabilistic machine learning 
algorithm based on Bayes theorem for binary and multi-class 
classification problems. A Naive Bayesian classifier is easy to 
build, with independent assumptions between predictors. 
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This assumption is most unlikely in real data. In Naive Bayes 
every pair of features being classified is independent of each 
other. It outperforms many other sophisticated classification 
models. Bayes’ theorem is stated as follows 

P(A/B) = P(B/A).P(A)/P(B) 


3.3. Random Forest 

Random forest is a supervised learning algorithm which 
operates by constructing multiple decision trees for both 
classification as well as regression. The random forest 
algorithm creates decision trees based on data samples and 
then predicts the output for each of them and voting is done 
finally to select the best solution or output. The main 
advantage of using random forest is that it reduces the over- 
fitting and provides high accuracy. 


3.4. SUPPORT VECTOR MACHINE 

A Support vector machine is a discriminative classifier that 
finds a hyper plane in n-dimensional space to distinctly 
classify the data points. Given a labelled training data, the 
algorithm outputs an optimal separating hyper plane which 
divides the data or classes. Support vectors are the data 
points or class closer to the hyper plane. SVM is a supervised 
learning method used for both classification and regression 
tasks. It is widely used algorithm since it produces high 
accuracy with less computation power. 


3.5. ARTIFICIAL NEURAL NETWORK 

Neural network is a machine learning algorithm that is 
designed based on the human brain. Artificial neural 
network is an information processing technique that works 
like the way human brain process the information. ANN 
contains 3 layers input layer, hidden layer and the output 
layer that are interconnected by nodes which contains an 
activation function. Neural network is greatly used in data 
mining field for classifying large datasets. It enhances the 
data analysis technology. 


4. LITERATURE REVIEW 

In this section a detailed description of the previous work 
that has gone into the research related to the data mining 
technique for heart disease prediction are explained. 


Purushottam et al. [2] (2015) designed a system that can 
efficiently discover the rules to predict the risk level of heart 
disease based on the given parameter. The rules are 
prioritized based on the user’s requirements. The system 
uses the classification model by covering rules as C4.5Rules. 
WEKA tool is used for dataset analysis and Knowledge 
Extraction based on Evolutionary Learning (KEEL) tool to 
find out the classification decision rules. The dataset is taken 
from the Cleveland Clinic Foundation. It contains total 76 
raw attributes, out of which only 14 of them are taken. This 
dataset contains several parameters like ECR, cholesterol, 
chest pain, fasting sugar, MHR and many more. The 
classification result of the decision tree is 87.4%. 


AnchanaKhemphila et al. [3] (2011) proposed a classification 
approach using Multi-Layer Perceptron with Back 
Propagation learning algorithm and a feature selection 
algorithm to diagnose heart disease. Information Gain 
concept is used for feature selection. First the model uses the 
ANN with no information gain-based feature selection 
function; the accuracy in training dataset is 88.46% and 
80.17% in the validation dataset. Further, the ANN is used 


for classification after deducting the feature with lowest 
information gain. Now, the accuracy is 89.56% in training 
dataset and 80.99% in validation dataset. The result shows 
that feature selection helps increase computational 
efficiency while improving classification accuracy. 


Nikhil Gawande et al. [4] (2017) proposed a system to 
classify heart disease using convolutional neural network. 
The system uses a CNN model to classify ECG signals. As it 
can be able to classify heart beat in different manner, there is 
no need for feature extraction. ECG signals have been taken 
from the MIT-BIH database. ECG signal is given as input to 
the system. Total 340 samples are trained and tested using 
CNN and the results are described in the confusion matrix. 
Even long ECG records can be classified in accurate manner 
and the distinct patient’s records can be treated once the 
CNN is trained. The accuracy obtained is 99.46%. 


V_ Krishnaiah et al. [5] (2014) introduces a fuzzy 
classification technique to diagnose the heart disease. The 
main objective of the research work is to predict the heart 
disease patients with more accuracy and to remove the 
uncertainty in unstructured data. The data source is 
collected from Cleveland Heart Disease database and Stalog 
Heart disease database. Cleveland database consist of 303 
records and Stalog database consists of 270 records and 
remaining records are collected from different hospitals in 
Hyderabad. 


M.A. Jabbar et al. [6] (2016) proposed a novel classification 
model Hidden Naive Bayes classifier for heart disease 
prediction. The heart dataset is downloaded from the UCI 
repository. Heart stalog dataset contains 14 attributes and 
270 instances. Hidden Naive Bayes evaluation is performed 
using 10-fold cross validation. WEKA tool is used for hidden 
naive bayes classification. The proposed approach performs 
pre-processing using discretization and IQR filters to 
improve the efficiency of Hidden Naive Bayes. The 
performance results show that the HNB classifier model 
recorded 100% accuracy compared with NB classification 
model. 


Jayshril S. Sonawane et al. [7] (2014) proposed a system to 
predict heart disease using multilayer perceptron 
architecture of neural network. The dataset is taken from the 
Cleveland heart disease database. The dataset containing 13 
clinical attributes fed as input to the neural network and 
back propagation algorithm is used to train the data. 
Compared to the decision support system for predicting 
heart disease using multilayer perceptron and_ back 
propagation algorithm, the proposed system achieves 
highest accuracy of 98.58% for 20 neurons. 


TanmayKasbe et al. [8] (2017) proposed fuzzy expert system 
for heart disease diagnosis. The database has been taken 
from the UCI repository and this dataset consist of 4 
databases that are taken from V.A. Medical center, Long 
Beach, Cleveland clinic foundation, Hungarian institute of 
cardiology, Budapest and University hospital, Zurich, 
Switzerland. A total of 76 input attributes and 1 output 
attribute are in the dataset, out of this the proposed system 
uses 10 important input attributes and 1 output attribute. 
MATLAB software is used for developing fuzzy system. The 
fuzzy system consists of 3 steps i.e. fuzzification, rule base 
and defuzzification. The proposed fuzzy expert system 
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achieves 93.33% accuracy and better performance 
compared to previous work on same domain. 


Aakashchauhan et al. [9] (2018) proposed a new rule to 
predict the coronary heart disease using evolutionary rule 
learning. Computational intelligence is used to discover the 
relationship between disease and patient. The dataset is 
taken from the Cleveland Heart Disease database. KEELis a 
java-based tool used for the simulation of evolutionary 
learning. Classification model was developed by association 
rule and the results are evaluated. The system will help 
doctor to explore their data and predicts the coronary 
disease accurately. 


Senthilkumarmohan et al. [10] (2019) proposed a novel 
method to improve the prediction accuracy of cardiovascular 
disease using hybrid machine learning techniques. The 
prediction model is designed with different combinations of 
features and several known machine learning classification 
techniques. Cleveland Heart Disease dataset is taken from 
the UCI repository. After feature selection 13 attributes are 
considered for further classification. R studio rattle is used 
for performing classification and the performance are 
evaluated. The prediction model uses hybrid random forest 
with a linear model and recorded 88.7% accuracy. 


Sinkonnayak et al. [11] (2019) proposeda method to predict 
the heart disease by mining frequent items and classification 
techniques. The dataset is taken from UCI repository and 
pre-processing is done. The frequent item mining is used for 
filtering the attributes and then the variant classification 
techniques like Decision tree, Naive Bayes, Support Vector 
Machine and KNN classification methods are used for 
predicting the heart disease at an early stage. R analytical 
tool is used for implementation. Out of these diverse data 
mining techniques Naive Bayes achieves 88.67% accuracy in 
predicting the heart disease with attribute filtration. The 
performance is evaluated using ROC curve. 


RahmaAtallah et al. [12] (2019) proposed a majority voting 
ensemble method to predict the heart disease in humans. 
Here the model classifies the patient based on the majority 
vote of diverse machine learning models in order to provide 
more accurate results. The dataset is taken from UCI 
repository and 14 predominant attributes are considered. In 
pre-processing Min-Max normalization is done. In order to 
analyze the data, a correlation value was calculated between 
each attribute and the target variable. It can be noted that 
the highest correlated features with the target attribute were 
Cp, Thalach, Oldpeak and Exang. For testing 4 types of 
classifier models are used namely Stochastic Gradient 
Descent (88%), KNN (87%), Logistic regression (87%) and 
Random forest (87%). These 4 models are combined in an 
ensemble model where the classification is done based on 
hard voting and finally the model achieves 90% accuracy. 


Haritajagad and Jehankandawalla et al, [13] (2015) proposed 
a model to detect coronary artery disease using different 
data mining algorithms namely Decision Tree, Naive Bayes 
and Neural Network. The parameters like patient age, sex, 
blood pressure etc. is taken during check up to evaluate the 
performance of these algorithms. Naive Bayes proved to be 
the fastest among three algorithms. Decision tree algorithm 
reliability depends on the input data and it is difficult to deal 
with large datasets. Neural network is basically used when 


dataset is small in size. There is no clear information about 
the accuracy level. 


M. AkhilJabbar et al, [14] (2013) proposed a neural network 
model to classify the heart disease using artificial neural 
network and feature subset selection. The feature subset 
selection is a method that is used to reduce the 
dimensionality of the input data. By reducing the number of 
attributes, the number of diagnosis tests which are needed 
by doctors from patient are also reduced. The dataset is 
taken from Andhra Pradesh hospital and results show that 
accuracy is enhanced over the out-dated classification 
techniques. The results also show that this system is faster 
and precise. 


Muhammad saqlain and wahidhussain et al, [15] (2016) 
proposed a multi nominal Naive Bayes algorithm to detect 
the heart failure. The data are collected from Armed Forces 
Institute of Cardiology (AFIC), Pakistan in the form of 
medical records. It uses 30 variables. The proposed 
algorithm is compared with different classification 
algorithms like Logistic Regression, Neural Network, SVM, 
Random Forest and Decision Tree. The performance of 
Navies Bayes algorithm is measured in terms of Precision, 
Accuracy, Recall and Area Under the Curve (AUC). Naive 
Bayes achieved highest accuracy of 86.7% and Area under 
the Curve (AUC) is 92.4% respectively. 


5. OBSERVATION 

From the literature review, it is observed that the dataset 
was taken from UCI repository containing Cleveland Heart 
Disease database and Stalog Heart disease database. The 
Naive Bayes, Decision tree, Random forest and neural 
network algorithms are predominantly used data mining 
techniques for predicting the heart disease with highest 
accuracy. But the authors should concentrate on minimizing 
the time utilization. Most of the research works have used 
existing machine learning techniques or combined two or 
more existing techniques to improve the prediction 
performance. Rather than taking vote or combining the 
existing algorithms the researchers should propose new 
algorithms for prediction. The researchers should explore 
more deep learning concepts in order toimprove the 
prediction performance. Since the heart disease is very 
dangerous a fast and reliable system need to be developed. 


6. CONCLUSION 

The main motive for conducting this survey is to comprehend 
the work of different authors and also to analyze how 
accurately we can predict the heart disease. These authors 
have proposed different data mining algorithms but there are 
certain limitations like neural networks performs well only 
with structured dataset [16]. MATLAB, Python and WEKA 
tool are widely used technologies for implementing these 
algorithms. However different algorithms generate different 
accuracy that purely depends on the size of the dataset, 
number of attributes selected [17] and tools used for 
implementation. The future works should concentrate on 
proposing new algorithms to achieve better performance and 
to minimize the time utilization. 
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